-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet to BigQuery import for GCP-backed AnVIL snapshots (#6355) #6392
base: develop
Are you sure you want to change the base?
Parquet to BigQuery import for GCP-backed AnVIL snapshots (#6355) #6392
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6392 +/- ##
===========================================
- Coverage 85.38% 85.12% -0.26%
===========================================
Files 155 155
Lines 20754 20823 +69
===========================================
+ Hits 17720 17725 +5
- Misses 3034 3098 +64 ☔ View full report in Codecov by Sentry. |
ccd127d
to
09c7a41
Compare
scripts/verify_tdr_sources.py
Outdated
'Actual Google project of TDR source differs from configured one', | ||
source.project, source_spec.project) | ||
source.project, source_spec.project, config.google_project()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source.project, source_spec.project, config.google_project()) | |
source_spec.project, source.project, config.google_project()) |
to match the order of these variables in the condition part of the require()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided on a different approach here to make the diff clearer and hopefully make the intent more apparent.
09c7a41
to
0d1c70f
Compare
I also moved some changes from the first commit to the 2nd, hence the other fixup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
Makefile
Outdated
python $(project_root)/scripts/reindex.py --import --sources "tdr:${GOOGLE_PROJECT}:snapshot/*" | ||
python $(project_root)/scripts/verify_tdr_sources.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two commands should be extracted to a separate make
target called import
, and a corresponding GitLab job. We also need to think about sandbox and personal deployments that typically share the sources with a main deployment. I think #6426 will help with this which is why I've added it as a blocker of #6355.
scripts/download_tdr_parquet.py
Outdated
@@ -0,0 +1,103 @@ | |||
""" | |||
Export parquet files from TDR and download them to local storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Export parquet files from TDR and download them to local storage. | |
Export Parquet files from TDR and download them to local storage. |
Here and in elsewhere in documentation.
0d1c70f
to
b3d454f
Compare
6c01bba
to
2168ed5
Compare
2168ed5
to
aae4509
Compare
4589e62
to
1a0a33a
Compare
aae4509
to
5195059
Compare
192d97a
to
a5f7692
Compare
6e7481e
to
be2dfe3
Compare
084a127
to
14df96c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somethings amiss here. It appears that this adds code that isn't actually exercised because none of the sources are configured to use Parquet. We should switch one source in dev
, sandbox
,anvildev
and anvilbox
to use Parquet. Lets also discuss in PL what to do about personal deployments.
Subject: [PATCH] make fo
---
Index: src/azul/terra.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/terra.py b/src/azul/terra.py
--- a/src/azul/terra.py (revision 14df96cd41cd529093df1520abd73d725b44aa28)
+++ b/src/azul/terra.py (date 1724981432852)
@@ -541,6 +541,7 @@
endpoint: furl,
response: urllib3.HTTPResponse
) -> MutableJSON:
+ # REVIEW: Short comment here explaining when we'd expect a 202
if response.status in (200, 202):
return json.loads(response.data)
# FIXME: Azul sometimes conflates 401 and 403
@@ -661,8 +662,8 @@
def create_dataset(self, dataset_name: str):
"""
- Create a BigQuery dataset in the project and region configured for the
- current deployment.
+ Create a BigQuery dataset in the GCP project associated with the current
+ credentials and the GCP region configure for the current deployment.
:param dataset_name: Unqualified name of the dataset to create.
`google.cloud.exceptions.Conflict` will be raised
@@ -670,10 +671,23 @@
"""
bigquery = self._bigquery(self.credentials.project_id)
ref = DatasetReference(bigquery.project, dataset_name)
+ # We get a false warning from PyCharm here, probably because of
+ #
+ # https://youtrack.jetbrains.com/issue/PY-23400/regression-PEP484-type-annotations-in-docstrings-nearly-completely-broken
+ #
+ # Google uses the docstring syntax to annotate types in its BQ client.
+ #
+ # noinspection PyTypeChecker
dataset = Dataset(ref)
+ # REVIEW: This changes the meaning of AZUL_TDR_SOURCE_LOCATION somewhat.
+ # While I don't think we need to introduce a new variable, we
+ # should document the new semantics so that someone modifying
+ # it is aware of the implications.
dataset.location = config.tdr_source_location
log.info('Creating BigQuery dataset %r in region %r',
dataset.dataset_id, dataset.location)
+ # REVIEW: This method returns something. Let's assert key aspects of the
+ # return value.
bigquery.create_dataset(dataset)
def create_table(self,
@@ -692,8 +706,11 @@
:param table_name: Unqualified name of the new table
+ REVIEW: Technically gs://… is a URI. If "URL" is TDR lingo I'm happy to
+ adopt it, otherwise we should use "URI".
+
:param import_urls: URLs of Parquet file(s) to populate the table. These
- must be `gs://` URLS and the GCS bucket's region
+ must be `gs://` URLs and the GCS bucket's region
must be compatible with the target dataset's. See
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#limitations
@@ -705,7 +722,7 @@
https://cloud.google.com/bigquery/docs/clustered-tables
"""
for url in import_urls:
- require(url.scheme == 'gs', url)
+ require(url.scheme == 'gs', 'Expected gs:// URI', url)
table_id = f'{dataset_name}.{table_name}'
bigquery = self._bigquery(self.credentials.project_id)
write_disposition = (
@@ -716,25 +733,26 @@
clustering_fields=clustering_fields,
source_format=SourceFormat.PARQUET,
# Avoids convoluted data types for array fields
+ # REVIEW: Please elaborate
parquet_options=ParquetOptions.from_api_repr(dict(enable_list_inference=True))
)
- log.info('Creating BigQuery table %r',
- f'{bigquery.project}.{dataset_name}.{table_name}')
+ table_ref = f'{bigquery.project}.{table_id}'
+ log.info('Creating BigQuery table %r', table_ref)
load_job = bigquery.load_table_from_uri(source_uris=list(map(str, import_urls)),
destination=table_id,
job_config=job_config)
load_job.result()
- log.info('Table created successfully')
+ log.info('Table %r created successfully', table_ref)
def export_parquet_urls(self,
snapshot_id: str
) -> Optional[dict[str, list[mutable_furl]]]:
"""
Obtain URLs of Parquet files for the data tables of the specified
- snapshot. This is an time-consuming operation that usually takes on the
- order of 1 minute to complete.
+ snapshot. This is a time-consuming operation that usually takes on the
+ order of one minute to complete.
- :param snapshot_id: The UUID of the snapshot.
+ :param snapshot_id: The UUID of the snapshot
:return: A mapping of table names to lists of Parquet file download
URLs, or `None` if if no Parquet downloads are available for
@@ -744,6 +762,8 @@
url = self._repository_endpoint('snapshots', snapshot_id, 'export')
# Required for Azure-backed snapshots
url.args.add('validatePrimaryKeyUniqueness', False)
+ # REVIEW: We should apply a timeout here. I suggest five times the
+ # longest observed duration
while True:
response = self._request('GET', url)
response_body = self._check_response(url, response)
@@ -752,6 +772,8 @@
if jobs_status == 'running':
url = self._repository_endpoint('jobs', job_id)
log.info('Waiting for job %r ...', job_id)
+ # REVIEW: What's this choice of two seconds based on? It seems
+ # rather short considering everything about TDR's robustness.
time.sleep(2)
elif jobs_status == 'succeeded':
break
Index: .gitlab-ci.yml
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
--- a/.gitlab-ci.yml (revision 14df96cd41cd529093df1520abd73d725b44aa28)
+++ b/.gitlab-ci.yml (date 1724981655861)
@@ -100,6 +100,8 @@
import:
extends: .base_on_push
stage: deploy
+ # REVIEW: Assume all sources need to be imported, estimate the expected
+ # running time and double that. Remove the comment.
timeout: 5m # probably needs to be extended
needs:
- build_image
2b088c8
to
d7ebee7
Compare
d7ebee7
to
65fc638
Compare
@hannes-ucsc: "Regarding personal deployments, the import should only be performed on shared deployments. By the time developers upgrade their personal deployments to mirror the respective sandbox, the GitLab build for that sandbox will have already imported the snapshot. If someone really needs to import a snapshot for a personal deployment, they can temporarily enable the import for personal deployments." |
65fc638
to
78545b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conflicts.
import: | ||
extends: .base_on_push | ||
stage: deploy | ||
# The 1000G snapshot on `anvildev` takes about 3.5 minutes to import. There |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure assuming that every snapshot is as big as 1000G leads to practical timeout.
A timeout is a heuristic defense against hung workloads, i.e., workloads that stop making significant progress. We don't want to constantly update the timeout, we don't want it to prematurely kill workloads that are progressing at the average rate, and we don't want the workload to be in the hung state for > 80% of it's running time. A 5min timeout goes against the first rule, a 30h timeout goes against the last.
@@ -73,7 +73,7 @@ def mkdict(previous_catalog: dict[str, str], | |||
|
|||
|
|||
anvil_sources = mkdict({}, 3, mkdelta([ | |||
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), | |||
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source spec should state where the source is, not where it will be when it is imported. The logic should be to import every parquet source. So I think this should read
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), | |
mksrc('bigquery', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), |
@@ -64,7 +64,7 @@ def mkdict(previous_catalog: dict[str, str], | |||
|
|||
|
|||
anvil_sources = mkdict({}, 3, mkdelta([ | |||
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), | |||
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), | |
mksrc('parquet', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), |
endef | ||
|
||
$(eval $(call deploy,)) | ||
$(eval $(call deploy,auto_)) | ||
|
||
.PHONY: import | ||
import: check_python | ||
python $(project_root)/scripts/reindex.py --import --sources "tdr:parquet:gcp:${GOOGLE_PROJECT}:*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python $(project_root)/scripts/reindex.py --import --sources "tdr:parquet:gcp:${GOOGLE_PROJECT}:*" | |
python $(project_root)/scripts/reindex.py --import --sources "tdr:parquet:gcp:*" |
Connected issues: #6355
Checklist
Author
develop
issues/<GitHub handle of author>/<issue#>-<slug>
1 when the issue title describes a problem, the corresponding PR
title is
Fix:
followed by the issue titleAuthor (partiality)
p
tag to titles of partial commitspartial
or completely resolves all connected issuespartial
labelAuthor (chains)
base
or this PR is not chained to another PRchained
or is not chained to another PRAuthor (reindex, API changes)
r
tag to commit title or the changes introduced by this PR will not require reindexing of any deploymentreindex:dev
or the changes introduced by it will not require reindexing ofdev
reindex:anvildev
or the changes introduced by it will not require reindexing ofanvildev
reindex:anvilprod
or the changes introduced by it will not require reindexing ofanvilprod
reindex:prod
or the changes introduced by it will not require reindexing ofprod
reindex:partial
and its description documents the specific reindexing procedure fordev
,anvildev
,anvilprod
andprod
or requires a full reindex or carries none of the labelsreindex:dev
,reindex:anvildev
,reindex:anvilprod
andreindex:prod
API
or this PR does not modify a REST APIa
(A
) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST APIapp.py
or this PR does not modify a REST APIAuthor (upgrading deployments)
make image_manifests.json
and committed the resulting changes or this PR does not modifyazul_docker_images
, or any other variables referenced in the definition of that variableu
tag to commit title or this PR does not require upgrading deploymentsupgrade
or does not require upgrading deploymentsdeploy:shared
or does not modifyimage_manifests.json
, and does not require deploying theshared
component for any other reasondeploy:gitlab
or does not require deploying thegitlab
componentdeploy:runner
or does not require deploying therunner
imageAuthor (hotfixes)
F
tag to main commit title or this PR does not include permanent fix for a temporary hotfixanvilprod
andprod
) have temporary hotfixes for any of the issues connected to this PRAuthor (before every review)
develop
, squashed old fixupsmake requirements_update
or this PR does not modifyrequirements*.txt
,common.mk
,Makefile
andDockerfile
R
tag to commit title or this PR does not modifyrequirements*.txt
reqs
or does not modifyrequirements*.txt
make integration_test
passes in personal deployment or this PR does not modify functionality that could affect the IT outcomePeer reviewer (after approval)
System administrator (after approval)
demo
orno demo
no demo
no sandbox
N reviews
label is accurateOperator (before pushing merge the commit)
reindex:…
labels andr
commit title tagno demo
develop
_select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused
or this PR is not labeleddeploy:shared
_select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply
or this PR is not labeleddeploy:gitlab
_select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused
or this PR is not labeleddeploy:shared
_select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply
or this PR is not labeleddeploy:gitlab
deploy:gitlab
deploy:gitlab
System administrator
dev.gitlab
are complete or this PR is not labeleddeploy:gitlab
anvildev.gitlab
are complete or this PR is not labeleddeploy:gitlab
Operator (before pushing merge the commit)
_select dev.gitlab && make -C terraform/gitlab/runner
or this PR is not labeleddeploy:runner
_select anvildev.gitlab && make -C terraform/gitlab/runner
or this PR is not labeleddeploy:runner
sandbox
label or PR is labeledno sandbox
dev
or PR is labeledno sandbox
anvildev
or PR is labeledno sandbox
sandbox
deployment or PR is labeledno sandbox
anvilbox
deployment or PR is labeledno sandbox
sandbox
deployment or PR is labeledno sandbox
anvilbox
deployment or PR is labeledno sandbox
sandbox
or this PR does not remove catalogs or otherwise causes unreferenced indices indev
anvilbox
or this PR does not remove catalogs or otherwise causes unreferenced indices inanvildev
sandbox
or this PR is not labeledreindex:dev
anvilbox
or this PR is not labeledreindex:anvildev
sandbox
or this PR is not labeledreindex:dev
anvilbox
or this PR is not labeledreindex:anvildev
p
if the PR is also labeledpartial
Operator (chain shortening)
develop
or this PR is not labeledbase
chained
label from the blocked PR or this PR is not labeledbase
base
base
label from this PR or this PR is not labeledbase
Operator (after pushing the merge commit)
dev
anvildev
dev
dev
anvildev
anvildev
_select dev.shared && make -C terraform/shared apply
or this PR is not labeleddeploy:shared
_select anvildev.shared && make -C terraform/shared apply
or this PR is not labeleddeploy:shared
dev
anvildev
Operator (reindex)
dev
or this PR is neither labeledreindex:partial
norreindex:dev
anvildev
or this PR is neither labeledreindex:partial
norreindex:anvildev
dev
or this PR is neither labeledreindex:partial
norreindex:dev
anvildev
or this PR is neither labeledreindex:partial
norreindex:anvildev
dev
or this PR is neither labeledreindex:partial
norreindex:dev
anvildev
or this PR is neither labeledreindex:partial
norreindex:anvildev
dev
or this PR does not require reindexingdev
anvildev
or this PR does not require reindexinganvildev
dev
or this PR does not require reindexingdev
anvildev
or this PR does not require reindexinganvildev
dev
or this PR does not require reindexingdev
anvildev
or this PR does not require reindexinganvildev
Operator
deploy:shared
,deploy:gitlab
,deploy:runner
,API
,reindex:partial
,reindex:anvilprod
andreindex:prod
labels to the next promotion PRs or this PR carries none of these labelsdeploy:shared
,deploy:gitlab
,deploy:runner
,API
,reindex:partial
,reindex:anvilprod
andreindex:prod
labels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labelsShorthand for review comments
L
line is too longW
line wrapping is wrongQ
bad quotesF
other formatting problem